good data
Good Data Is All Imitation Learning Needs
Samadi, Amir, Koufos, Konstantinos, Debattista, Kurt, Dianati, Mehrdad
In this paper, we address the limitations of traditional teacher-student models, imitation learning, and behaviour cloning in the context of Autonomous/Automated Driving Systems (ADS), where these methods often struggle with incomplete coverage of real-world scenarios. To enhance the robustness of such models, we introduce the use of Counterfactual Explanations (CFEs) as a novel data augmentation technique for end-to-end ADS. CFEs, by generating training samples near decision boundaries through minimal input modifications, lead to a more comprehensive representation of expert driver strategies, particularly in safety-critical scenarios. This approach can therefore help improve the model's ability to handle rare and challenging driving events, such as anticipating darting out pedestrians, ultimately leading to safer and more trustworthy decision-making for ADS. Our experiments in the CARLA simulator demonstrate that CF-Driver outperforms the current state-of-the-art method, achieving a higher driving score and lower infraction rates. Specifically, CF-Driver attains a driving score of 84.2, surpassing the previous best model by 15.02 percentage points. These results highlight the effectiveness of incorporating CFEs in training end-to-end ADS. To foster further research, the CF-Driver code is made publicly available.
Model-Based Data-Centric AI: Bridging the Divide Between Academic Ideals and Industrial Pragmatism
Park, Chanjun, Khang, Minsoo, Kim, Dahyun
This paper delves into the contrasting roles of data within academic and industrial spheres, highlighting the divergence between Data-Centric AI and Model-Agnostic AI approaches. We argue that while Data-Centric AI focuses on the primacy of high-quality data for model performance, Model-Agnostic AI prioritizes algorithmic flexibility, often at the expense of data quality considerations. This distinction reveals that academic standards for data quality frequently do not meet the rigorous demands of industrial applications, leading to potential pitfalls in deploying academic models in real-world settings. Through a comprehensive analysis, we address these disparities, presenting both the challenges they pose and strategies for bridging the gap. Furthermore, we propose a novel paradigm: Model-Based Data-Centric AI, which aims to reconcile these differences by integrating model considerations into data optimization processes. This approach underscores the necessity for evolving data requirements that are sensitive to the nuances of both academic research and industrial deployment. By exploring these discrepancies, we aim to foster a more nuanced understanding of data's role in AI development and encourage a convergence of academic and industrial standards to enhance AI's real-world applicability.
AI Is Running Circles Around Robotics
When people imagine the AI apocalypse, they generally imagine robots. But the robot-takeover scenario most often envisioned by science fiction is not exactly looming. Recent and explosive progress in AI--along with recent and explosive hype surrounding it--has made the existential risks posed by the technology a topic of mainstream conversation. Yet progress in robotics--which is to say, machines capable of interacting with the physical world through motion and perception--has been lagging way behind. "I can't help but feel a little envious," said Eric Jang, the vice president of AI at the humanoid-robotics company 1X, in a talk at a robotics conference last year.
Cleanlab: Correct your data labels automatically and quickly – Towards AI
Originally published on Towards AI. I used an open-sourced library, cleanlab, to remove low-quality labels on an image dataset. The model trained on the dataset without low-quality data gained 4 percentage points of accuracy compared to the baseline model (trained on all data). Improving data quality sounds easy enough. But the workload of manually checking data quality can quickly become insurmountable as the dataset scales.
Good Data from Bad Models : Foundations of Threshold-based Auto-labeling
Vishwakarma, Harit, Lin, Heguang, Sala, Frederic, Vinayak, Ramya Korlakai
Creating large-scale high-quality labeled datasets is a major bottleneck in supervised machine learning workflows. Auto-labeling systems are a promising way to reduce reliance on manual labeling for dataset construction. Threshold-based auto-labeling, where validation data obtained from humans is used to find a threshold for confidence above which the data is machine-labeled, is emerging as a popular solution used widely in practice. Given the long shelf-life and diverse usage of the resulting datasets, understanding when the data obtained by such auto-labeling systems can be relied on is crucial. In this work, we analyze threshold-based auto-labeling systems and derive sample complexity bounds on the amount of human-labeled validation data required for guaranteeing the quality of machine-labeled data. Our results provide two insights. First, reasonable chunks of the unlabeled data can be automatically and accurately labeled by seemingly bad models. Second, a hidden downside of threshold-based auto-labeling systems is potentially prohibitive validation data usage. Together, these insights describe the promise and pitfalls of using such systems. We validate our theoretical guarantees with simulations and study the efficacy of threshold-based auto-labeling on real datasets.
Efficient Medical Image Assessment via Self-supervised Learning
Huang, Chun-Yin, Lei, Qi, Li, Xiaoxiao
High-performance deep learning methods typically rely on large annotated training datasets, which are difficult to obtain in many clinical applications due to the high cost of medical image labeling. Existing data assessment methods commonly require knowing the labels in advance, which are not feasible to achieve our goal of 'knowing which data to label.' To this end, we formulate and propose a novel and efficient data assessment strategy, EXponentiAl Marginal sINgular valuE (EXAMINE) score, to rank the quality of unlabeled medical image data based on their useful latent representations extracted via Self-supervised Learning (SSL) networks. Motivated by theoretical implication of SSL embedding space, we leverage a Masked Autoencoder for feature extraction. Furthermore, we evaluate data quality based on the marginal change of the largest singular value after excluding the data point in the dataset. We conduct extensive experiments on a pathology dataset. Our results indicate the effectiveness and efficiency of our proposed methods for selecting the most valuable data to label.
Pinaki Laskar on LinkedIn: #MLOps #AI #machinelearning
AI Researcher, Cognitive Technologist Inventor - AI Thinking, Think Chain Innovator - AIOT, XAI, Autonomous Cars, IIOT Founder Fisheyebox Spatial Computing Savant, Transformative Leader, Industry X.0 Practitioner Why #MLOps is the key for productionized ML system? ML model code is only a small part ( 5–10%) of a successful ML system, and the objective should be to create value by placing ML models into production. F1 score) while stakeholders focus on business metrics (e.g. Improving labelling consistency is an iterative process, so consider repeating the process until disagreements are resolved as far as possible. For instance, partial automation with a human in the loop can be an ideal design for AI-based interpretation of medical scans, with human judgement coming in for cases where prediction confidence is low.
Good data is a key component to AI innovation and machine learning
When the Biden Administration launched an AI task force earlier this month to create a path to "democratize access to research tools to promote AI," the goal of access was paramount. "The task force consists of some of the top experts in academia and industry," said Dinesh Manocha, a professor of computer science and electrical and computer engineering at the University of Maryland, on Federal Monthly Insights – Repurposing Manpower through Automation. "They recognize the importance and they're pushing for more development in the field by making good data available. So data is a very key component of AI and machine learning-based methods. Manocha said AI is as old as the field, pointing to "Founding Father" Alan Turing, whom he said laid the foundations in the 1950s. "Machine learning is one sub-area in the broader field of AI," said Manocha on Federal Drive with Tom Temin. "All the recent developments in AI, all the penetration in the real world, has primarily been driven by the excitement in last five to 10 years from machine learning." The breadth of AI and machine learning is quite evident by simply looking at classes offered at the University of Maryland and what students want to study. "A lot of computer science majors want to take AI," Manocha said. "Machine learning, by itself, has become such an important sub topic that we even offer multiple classes in it at the undergraduate and graduate levels." Focusing on data and algorithms, AI and machine learning imitate the way humans learn. "So you know one of the grand challenges in AI is how can we emulate human-like intelligence, which is still a big open problem," Manocha said. "There have been a lot of approaches pursued and proposed by a wonderful researchers the last 50 to 60 years." Manocha pointed to great advances in AI and machine learning by talking about the sub-branch of "deep learning," which imitates human knowledge and thinking. But there is still a ways to go. "If you have some data, you can easily get 50%-to-70% accuracy," Manocha said. "To go from 50-to-70, to 90%, you get 100x more data.
Smart Water: Data Labeling with Active Learning And H2O.ai
Data is the food for AI. For machine Learning, or supervised learning, the golden labels are key for the models to recognize the pattern within the data. However, in the real-world data, it is usually hard to get large amount of labeled data, for example, search revelance, news topics, autopilot, etc. Recently, Angrew Ng gave a talk on MLOps: From Model-centric to Data-centric AI, where he mentioned the Idea from Big Data to Good Data. Good data is defined consistently and cover of important cases.